Abstract
Background: Baseline and outcome data for patients undergoing hematopoietic stem cell transplantation (HCT) are required to be reported to the Center for International Blood and Marrow Transplant Research (CIBMTR). This process is time-consuming and relies on trained data managers. Large language models (LLMs) may help streamline this workflow. In this study, we developed and tested locally deployable LLMs that can run within hospital systems, without sending patient data outside the institution. Our goals were threefold: (1) to evaluate LLM performance in extracting CIBMTR-required fields, (2) to assess their ability to extract additional research-specific data, and (3) to improve model performance through fine-tuning on medical notes.
Methods: We obtained inpatient charts from 175 patients who underwent allogeneic HCT between 1/2014 and 12/2023 at the Ohio State University Comprehensive Cancer Center. To evaluate CIBMTR field extraction, we selected 16 commonly reported variables, such as transplant date, disease category, performance status, HLA match, donor type, ABO type, CMV serostatus, conditioning regimen, GVHD prophylaxis regimen, and key demographic variables, and provided each model with admission and discharge summaries. Extraction outputs were compared to data entered by experienced data managers. For research data extraction, we examined patients at risk of cytokine release syndrome (CRS) after HCT, with a focus on the use of tocilizumab for treatment. Models were given transplant D0 to D+5 progress notes and discharge summaries, and asked to extract 14 clinically relevant fields, including fever onset and resolution dates, maximum temperature, tocilizumab use and dose, and presence of neurotoxicity. Extracted values were compared to those manually curated by a hematology fellow. To improve LLM performance, we created a transplant-focused dataset from publicly available clinical notes and used it to train a customized version of the Gemma-3 model, which we refer to as Gemma-BMT. Data extraction was completed using two NVIDIA A6000 GPUs.
Results: For CIBMTR field extraction, we evaluated four off-the-shelf local models: LLaMA3.1-70B, Qwen3-30B, Gemma3-27B, and MedGemma-27B. Across 2,800 total data fields, these models achieved correct extraction rates of 72%, 74%, 78%, and 79%, respectively, with corresponding F1 scores of 0.81, 0.82, 0.84, and 0.84. Highest performance was observed for reliably documented variables such as gender, stem cell source, conditioning regimen, and GVHD prophylaxis (F1 > 0.95). In contrast, accuracy was lower for fields with more variable documentation, including donor ABO type, CMV serostatus, and performance status. Extraction of 175 patient charts was completed in under one hour.
For research field extraction, model performance was more variable. F1 scores ranged from 0.72 to 0.87, with MedGemma consistently outperforming the others. It achieved near-perfect accuracy (F1 > 0.95) on well-structured binary variables such as CRS presence, neurotoxicity, and tocilizumab use. It also performed well on structured temporal and numeric fields, including maximum temperature and tocilizumab start/stop dates (F1 > 0.80). Performance declined for fields with inconsistent or missing documentation, such as fever onset and resolution dates. Full extraction was completed in under five hours.
We also evaluated a custom fine-tuned model, Gemma-BMT, trained on transplant-specific notes. On CIBMTR field extraction, Gemma-BMT achieved a 75% correct extraction rate and the highest F1 score (0.85) among all models. For research fields, it correctly extracted 81% of data points and matched MedGemma's top F1 score of 0.87, slightly outperforming it in overall accuracy.
Conclusions: Locally deployable LLMs can accurately extract both CIBMTR-required and research-specific clinical data from patient charts, potentially reducing the burden of manual data abstraction in alloHCT. Model performance was highest for clearly documented clinical fields and improved further with domain-specific fine-tuning. These results support the feasibility of integrating LLMs into institutional data workflows while preserving patient privacy.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal